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INTERPRETING STANDARDIZED TEST SCORES 
Dali is Perry 

Student Counseling Bureau, University of Minnesota 

Uses of Tests 

Standardized tests are used to assist In making a wide variety of educa- 
tional decisions: 

Which students should be selected for Special Program A? 

What educational and vocational plans are reasonable for Student B? 

For what level of instruction in mathematics is Student C ready? 

Has Class D made the expected progress in science? 

How successful is the new social studies curriculum in School E? 

Test scores provide just one of many kinds of information that must be 
evaluated and integrated to answer these questions. The ways in which such 
Information is used in educational administration, instruction, and guidance 
is the subject of such disciplines as educational evaluation, teaching metho- 
dology, and counseling and is beyond the scope of this discussion; but, before 
we use test results in any way, we must understand what information is con- 
tained in the test scores— what it is they do and do not tell us. 

Cronbach's (1970) definition of a test as "a systematic procedure for 
observing a person's behavior and describing it with the aid. of ..a numerical 

scale or category system" is perhaps as satisfactory as any. The tests with 

* 

V " • 

which we are concerned are standardized tests — standardized both with respect 
to the presentation of the stimuli (items) that elicit the behavior that is 
observed and with respect to the reference data by which the numerical results 
are interpreted. The score that results directly from a test operation is 
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ordinarily an arbitrary and quite meaningless figure. A considerable por- 
tion of test technology is concerned with procedures for transforming such 
"raw" scores to scales that "build in" significance through their relation 
to one or more kinds of reference information. Two general classes of trans- 
formed scores are "norm-referenced" scores, which indicate relative standing 
in comparison with a specified reference group, and "criterion-referenced" 
scores, which relate test performance to the kind of behavior exhibited by or 
expected from, the examinee. Underlying the interpretation of both kinds of 
scores are the concepts of validity and accuracy of measurement and the assump 
tion that the tests have been presented to students in a standard manner. The 
following sections discuss test administration circumstances and the concepts 
of measurement accuracy, validity, norm-reference, and criterion-reference 
as they influence the interpretation of standardized test scores. 

P * 

Test Administration 

Fundamental to a standardized test is the equivalence of test content 
from one. student to another, which makes possible comparison of scores. It 
is essential that this standardization not be compromised by special instruc- 
tions, assistance, or failures in test security that may effectively alter 
the content in unknown ways for some students. Testing conditions cannot, 
of course, be identical for all examinees, but they should be comparable in 
every way possible. Because most educational tests are regarded as measures 
of maximum performance, each student must have an opportunity to do his best. 
Satisfactory physical conditions of lighting, heating, ventilation, space, 

, ** 4 

and work surfaces are assumed, as well as rigid adherence to specified direc- 
tions and time limits. Equally important, and much more difficult to control, 

O 
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are the internal conditions that each student brings to the test. If a test 
score is to represent maximum performance, the effort and therefore the moti- 
vation to do well on the test must be comparable to that expected in the 
situations to which the test score is related. Test manuals and administration 
directions give little attention to pretest preparation and instruction of 
examinees. A clear explanation of the purpose and significance of the tests, 
without resorting to exhortation, is preferable to presentation of the tests 
as a required but unexplained task. Motivation cannot be completely standar- 
dized, of course, and the counselor or teacher with specific knowledge of each 
student as well as of the testing situation can best judge whether a given test 
score should be accepted at face value, regarded with extra caution, or dis- 
regarded completely because of the circumstances in which it was obtained. 

Accuracy of Measurement 

No single test score is completely representative of the "universe" of 

behavior for a person. A test score is based on a sample of behavior, and 

■* ■ , { 

scores based on different samples can be expected to vary. Interpretation 
of the score must take into account the amount of such variation to be expected 
under given circumstances. This variation is usually expressed as "error of 
measurement", considered to be the difference between the observed score and 
a hypothetical "true" score consisting of the mean of a very large number of 
measurements of the same kind on the same person. 

Standard Error of Measurement 

The standard deviation of the distribution of measurements on a person, 
of which the person's true score is the mean (or equivalently the standard 
deviation of the differences between true and observed scores), is called the 
standard error of measurement (s.e.m.) for that person. Although the s.e.m. 
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on a test varies from one person to another, in practice the average s.e.m. 
over a sample of persons is determined as an estimate of the s.e.m. on the 
test for each person. 

The s.e.m. of a test indicates the extent to which a person's scores 
obtained by repeated measurement of the same kind would vary around the per- 
son's true score. It may be pictured as shown in Fia. 1. Within the range 
of +_ 1 s.e.m. from a person's true score will fall 68% of his obtained scores, 

Figure 1 about here 

and 95% will fall within +_ 2 s.e.m.. If the s.e.m. is 3» for exampl e, the 
probability is 68% that any observed score is within 3 points of the true score. 

The s.e.m. of a test is important because it emphasizes that an observed 
test score is just an estimate and not a precisely determined number, and at 
the same time it quantifies the dependability of the score. Test scores are 
sometimes reported as ranges or bands, typically extending 1 s.e.m. above and 
below the observed score, with or without the observed score indicated. Although 
the interpretation of such ranges is difficult to specify precisely in probability 
terms, they have the advantage of emphasizing to users the limits of score depend- 
ability. 

In evaluating scores on a test with reference to its s.e.m., two points 
should be considered: 

1. The reported s.e.m. is an estimate of the average s.e.m. for all 
persons who take the test. Individuals differ in their variability as well as 
in their true scores, so the actual s.e.m. is not the same for all persons. 

The s.e.m. of a well -constructed test should not be correlated with test scores, 
but in practice persons near the extremes of a distribution are less likely' to 
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be measured accurately than those near the middle. This situation may arise, 
for example, if the test has insufficient "cei l ing" so that differences among 
the more able students cannot be detected, or if it is so difficult for some 
students that they respond randomly or by excessive guessing. 

2. Different procedures used to estimate the s.e.m. of a test ascribe 
different sources of observed score variance: to error. It is important to 
keep in mind the sources of variance represented in the s.e.m., and, therefore, 
the genera 1 i zab i 1 i ty of the score. Internal consistency procedures (Kuder- 
Richardson formulas, split-half, odd-even) or alternate form correlations 
generally include as error that variance due to sampling of test content and 
that due to momentary factors that differentially influence performance during 
a single testing occasion. Factors that would differentially affect scores on 
another occasion are ascribed to "true" score variance. Retesting at a differ- 
ent time with the same instrument leads to the inclusion of differences due to 
testing occasions, but not differences due to content sampling, in the error 
variance estimate. 

Rel iabi 1 ity Coefficients 

As Fig. 1 implies, the error variance ordinari ly is much smaller than the 
total variance on a test. If it were not-- if all the variance were error vari- 
ance — there would be no true score variance and the test would have no value. 
Interpretation of the s.e.m. of a test depends in part on how much smaller than 
total score variance it is. An s.e.m. of 3 has quite different significance 
for a test with a standard deviation of 4 than for a test with a standard devi- 
ation of 46. The variance of;a group of test scores is composed of the error 
variance plus the true score variance, or 



observed 




( 1 ) 



true 



error 
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The relationship of these variances is usually expressed as the ratio of true 
score variance to total score variance, called the reliability coefficient, 

( 2 ) 




Because true score variance, and therefore total observed variance, is a func- 
tion of the heterogeneity of the group being measured, a reliability coefficient 
reflects both group and test characteristics, whereas the error component of 
scores on a test, (s.e.m. squared) is regarded as a characteristic of the test, 
fixed for all groups. In interpretation of an individual test score the s.e.m. 
most directly indicates the confidence that can be placed in the score, but 
the stability of the score with respect to an entire group of scores, as indi- 
cated by the reliability coefficient, also should be known. Given the standard 
deviation of the group in question one can, from (1) and (2) above, compute 
either s.e.m. or r from the other according to the familiar formulas 
s.e.m. = S J] -r (3) 



r = 1 — (s.e.m.) 



2 






Internal consistency reliability of the Minnesota Scholastic Aptitude 
Test (MSAT) was found to be .93 (Layton, no date), which indicates, according 
to formulas (3) and (A), a s.e.m. about one-fourth as large ( \f. 07 = .26) as 
the standard deviat ion of 1 3.8, or about 3-7. Referring to the MSAT norms in 
Table 1 we find that, if, for example, a student's "true" score is at the 71st 
percentile (RS=M), about two-thirds of the time in repeated testing his 
observed MSAT score would be between the 63rd and 70th percentiles. He would 
obtain a score below the 5^th percentile less than 3% of the time. 

: 11 .V V,\ 
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The relationship of these variances is usually expressed as the ratio of true 
score variance to total score variance, called the reliability coefficient, 

( 2 ) 




Because true score variance, and therefore total observed variance, is a func- 
tion of the heterogeneity of the group being measured, a reliability coefficient 
reflects both group and test characteristics, whereas the error component of 
scores on a test, (s.e.m. squared) is regarded as a characteristic of the test, 
fixed for all groups. In interpretation of an individual test score the s.e.m. 
most directly indicates the confidence that can be placed in the score, but 
the stability of the score with respect to an entire group of scores, as indi- 
cated by the reliability coefficient, also should be known. Given the standard 
deviation of the group in question one can, from (1) and (2) above, compute 
either s.e.m. or r from the other according to the familiar formulas 

(3) 



s.e.m. = S Q \f] -r 



r = 1 — (s.e.m.)' 

c 2 



(4) 



Internal consistency reliability of the Minnesota Scholastic Aptitude 
Test (MSAT) was found to be .93 (Layton, no date), which indicates, according 
to formulas (3) and (4), a s.e.m. about one-fourth as large ( sfo7 = .26) as 
the standard deviation of 13.8, or about 3.7- Referring to the MSAT norms in 
Table 1 we find. that, if, for example, a student's "true" score is at the 71st 
percentile (RS=AA), about two-thirds of the time in repeated testing his 
observed MSAT score would be between the 63rd and 70th percent i les . He would 
obtain a score below the. 54th percentile less than 3% of the time. 
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Va 1 id i ty 

The most critical information underlying the interpretation of test 
scores is how well the scoresmeasure the characteristic the test is being 
used to measure, i.e., how valid is the test for the purpose to which it is 
being put. Because a test may be used for many purposes, it may have many 
validities and even several different kinds of validity. Different kinds of 
validity are generally classified into three categories: content validity, 

criterion-related validity, and construct validity. 

Content Validity 

When a test is used to determine a person's current knowledge or perfor- 
mance in a domain represented by the test, evidence of how well the test actually 
represents the domain is required to establish the content validity of the test. 
Such evidence usually takes the form of an analysis of the domain into subdivi- 
sions, description of the subdivisions, and identification of the items related 
to each subdivision. In educational achievement tests such subdivisions usually 
correspond to educational objectives. It is important that both subject matter 
content and process be included in the analysis and description of the test. 

Establishment of a test's content validity requires demonstration not only 
of what the test does measure but also of what it does not measure. Extraneous 
factors that are measured by a test but are not conceptually a part of its con- 
tent lower its content validity. Two of the most common such influences are 
reading skill and working speed, because so many achievement tests are composed 
of written items and are given with time limits. 

The careful analysis and description of the measurement domain which 
character ize the establ i shment of content validity distinguish it from "face 
val id i ty", wh ich refers to the superficial appearance, or even name, of a test. 
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Motivation may be better if test items appear to examinees to be relevant to 
the purposes of testing; therefore, face validity may be desirable, but it 
is not the same as content validity. 

C riterion-related Validity 

When a test is used to predict a specific kind of performance other than 
that measured by the test itself, evidence is required that the test scores 
are indeed related to the other, criterion, performance. Such evidence is most 
commonly presented in the form of a coefficient of correlation between test 
and criterion scores. 

Clearly, a test has as many validities as criteria. Thus the median cor- 
relation of MSAT scores with grades of freshmen in Minnesota colleges is .^3, 
which demonstrates its validity as a measure of scholastic aptitude; but the 
coefficients in individual colleges vary from .10 to .76. 

Adequate evidence of criterion-related validity requires not only a valid- 
ity coefficient of sufficient size to be useful but aiso a criterion measure 
that truly represents the behavior or performance to be predicted. School 
marks or grades are the most commonly used educational criteria, and tests val- 
idated against such measures must be used with awareness of the timited scope 
of relevant behavior represented in the criterion. Nevertheless, because grades 
do represent a significant aspect of achievement and one that may be critical in 
determining continuation and completion of an educational program, correlation 
of test scores with grades is an important and meaningful indication of validity 
Construct Validity 

Criterion-related validity is invaluable for use of a test to aid in reach- 
ing a decision, e.g. , choice of college, regarding a specific course of action, 
the outcome of which can be measured in some way, e.g., by subsequent course 



grades. However, we cannot expect that tests will have been specifically vali- 
dated against criteria for all decisions of all students who may be aided by 
a better understanding of their capabilities and characteristics as measured 
by tests. For effective counseling use of tests to help understand students 
and to help students understand themselves we must know "what the test measures" 
apart from its prediction of behavior in specific situations. Evidence of the 
meaning of test scores in terms of the psychological characteristics, or con- 
structs, represented by the scores is termed "construct validity". Such 
evidence may take the form of analysis of the content of the test, synthes i s of 
criterion-related validity coefficients, correlations with other tests, factor 
analysis, differences or similarities of scores of specified groups (e.g., age 
or educational levels), item analysis, observation of test-taking behavior, 
and influence of training or experience on scores. As with evidence of content 
validity, demonstration of what the test does not measure is as important as 
demonstration of what it does measure. 

Interpretation of the Differential Aptitude Tests (DAT) for counseling 
secondary school students, for example, depends largely on construct validity. 
Although the DAT manual reports more than 5,000 predictive validity coefficients 
few counselors will have such evidence available for their students and for 
criteria specifically relevant for thei r students . Focusing on the Mechanical 
Reasoning (MR) test we find by examining the i terns that they deal with gears, 
levers, pulleys, the application of forces, and similar principles that are 
part of the content of physical mechanics. The items are presented pictorially, 
with verbal questions about the pictures, so the test requires some reading 
abi lity; but the questions and the words in them are short and. should be easi ly 
understood. Correlations of about .5 to .6 with the Verbal Reasoning test and 
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with various intelligence tests indicate that MR is measuring something dif- 
ferent than verbal ability, and item analyses of the very similar Mechanical 
Reasoning Test indicate that it is measuring a general mechanical ability, not 
separate "levers ability", "gears ability", etc. (Cronbach, 1970). On the 
average MR correlates higher with high school grades in science than in other 
subjects (although it is not the best DAT predictor of science grades), and it 
was found to be an effective predictor of vocational school performance of 
machine shop students but not of auto mechanics students. Girls' scores on 
the test tend to be substantially lower and less reliable than boys' scores 
and to have higher correlations with grades in "unrelated" high school courses 
such as English and social science, suggesting that the test functions somewhat 
differently for the two sexes. Because MR is a revision of earlier Mechanical 
Comprehension Tests, evidence that scores on the latter are related to evalu- 
ations of training and job performance of various jobs concerned with machinery 
supports the construct validity of MR. Finally, MR scores are correlated about 
.A with mechanical and scientific interests of boys as measured by the Kuder 
Preference Record and negligibly with other interests. Again, the relationships 
for girls are lower. Taken together the evidence briefly summarized above sup- 
ports the notion that MR measures a meaningful characteristic of students, one 
that is appropriately labeled "mechanical reasoning", is not the same as general 
intelligence, and is important in certain scientific and mechanical pursuits 
though not in every activity labeled "mechanical". 

Establ i shment of construct validity in a different domain is illustrated 
by the development of the Academic Achievement (AACH) scale for the Strong Voca- 
tional Interest Blank (Campbell and Johansson, 1 966) . This scale was developed 
by selecting SVIBi terns that significantly differentiated between students with 
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high grades in college and high school and those with low grades. The scale 
correlated about .35 with high school and college grade averages in a cross- 
validation sample drawn from the same population as that on which the scale 
was constructed and also in a 25-year-old sample of college freshmen tested 
in the 1930's. Low correlations with MSAT scores show that the scale is not 
just another measure of scholastic aptitude, and the AACH score adds slightly 
to the multiple correlation of HSR and MSAT with college GPA. In 10-year and 
25“year follow-up groups the scale showed substantial differences between stu- 
dents who dropped out of college, and, in order, those who obtained BA, MA, 
and Ph D degrees. Scores were found to increase until about age 28 and then 
remain relatively stable. Examination of the item content indicates that items 
scored positively represent scientific, aesthetic, and intellectual activities, 
whereas those scored negatively involve sales, business, and manual skills. 

AACH scores of occupational groups are ranked very much like the average educa- 
tional levels of the groups, with scientists (biologists, mathematicians, 
psychiatrists, physicists) at the top and policemen, forest service men, pilots 
and office workers at the bottom. Scores of outstanding persons in 10 occupa- 
tions showed similar differences, with outstanding composers, novelists, 
astronauts, and psychologists scoring high and outstanding life insurance sales 
men, military men, and football coaches scoring low. In summary the AACH scale 
appears to measure interest in activities that lead to getting good grades and 
continuing in school, but it is not a measure of scholastic aptitude as such 
nor a predictor of success within occupations. 

Norm-Referenced Scores 

A norm-referenced score indicates an individual ! s standing in comparison 
with a standard reference group of persons who have taken the same test. In 
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the interpretation of norm-referenced scores both the nature of the score 
transformation and the nature of the reference group must be considered. 

Score Transformations 

The most commonly used norm-referenced scores are percenti les , standard 
scores , a nd grade equivalents . 

Percenti les. Percentile scores indicate relative standing in a group in 
very much the way rank ordering does, and they are often called percentile 
ranks . Because the meaning of a given rank depends on the size of the group 
ranked, percentiles adjust for group size by, in effect, indicating the equiva- 
lent of rank order in a standard group of 100 scores. The concept of rank 

order and the analogy of"a ladder with 100 rungs" are easy to understand, and 

percentiles are much used because of the ease with which their meaning can be 
communicated. The most likely misunderstanding of percentiles is an interpre- 
tation of them as indicating "percent correct", and in reporting test results 
to students and parents it is important to insure that this interpretation is 
not made. 

A distribution of percentile scores from a group comparable to that on 
which the percentile norms are based will be rectangular, that is, will have 
about the same number of cases at each score. There will be, for example, 

about the same number of scores at the 98th percentile as at the 50th. Because 

there are far more cases near the middle of the raw score distribution than 
near the extremes, a small raw score change results in a much larger percentile 
change near the middle than near the extremes. This tendency to accentuate 
differences among mid-range scores and de-emphasize differences among extreme 
scores is a major disadvantage of percentiles. 
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Standard scores . This disadvantage is avoided by standard scores, in 
which differences are proportional to raw score differences. Standard scores 
are anchored at the mein of the norm group distribution, with units propor- 
tional to the standard deviation of the norm group distribution. The basic 
standard score transformation (z-score) is made by subtracting the mean from 
each score and dividing the remainder by the standard deviation, producing a 
score with mean of zero and standard deviation of 1. Because the fractional 
and negative scores produced by the z-score transformation are inconvenient, 
transformations that assign more units to the standard deviation and a posi- - 
tive score to the mean are usually used for score interpretation. Some 
standard score transformations commonly encountered are: 



Score 


Mean 


S.D. 


Relation 


1 to z 


Stani ne 


5 


2 


+ 

N 

eg 


5 


ITED, ACT 


15 


5 


+ 

N 

LA 


15 


T-Score 


50 


10 


+ 

N 

O 


O 

LA 


GATB 


100 


20 


20z + 


100 


CEEB 


500 


100 


+ 

N 

O 

O 


500 



Because standard score differences are proportional to raw score differences, 
comparisons of scores in different parts of the distribution are less subject 
to misinterpretation than comparisons of percenti les; and standard scores can 
be manipulated mathematical ly to obtain meaningful averages, correlations, etc 
The meaning of a standard score is not immediately clear, however, without 
some understanding of i ts relation to a normal distribution of scores. Fig. 2 
pictures this relationship for several standard score scales as. wel 1 as for 
percent lies. 
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Figure 2 about here 

Grade equivalents . Whereas a percentile or a standard score indicates 
the location of a score within one specified norm group distribution, a grade- 
equivalent score identifies a specific score distribution for which the obtained 
score is the median. The score distribution is for students at a particular 
grade level. For example, if the grade equivalent for a raw score of 38 is 4.0, 
38 would be the median score of the norm group of beginning 4th-graders. Deci- 
mal parts are added to represent fractions of a 10-month school year, so that 
a grade equivalent of 4.2, for example, represents the median of students tested 
at the end of the second month of the 4th grade. Although there is a hypothe- 
tical norm group for each separate grade equivalent, in practice only a few 
levels are tested within the range of grade equivalents reported. A junior high 
school achievement test might be normed on students tested in the middle of the 
seventh (7*5), eighth (8.5) and ninth (9.5) grades, for example. intermediate 
grade equivalents are determined by interpolation, and equivalents below the 
lowest group tested and above the highest group tested are determined by extra- 
polation. 

Because grade equivalents are especially convenient for measuring progress, 
and because the significance of the score that is "built in" in the form of 
reference to educational levels seems especially easy to understand, grade equi- 
valents are widely used. They have some disadvantages, however, that should 
cause users to interpret them with special caution. Although the meaning of a 
grade equivalent of 6.6 for a student in the middle of the 6th grade is clear, 
the meaning of the same score for a student in the middle of the 4th grade is 
less clear because we have no guidance as to whether such a deviation from the 
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"expected" score is rare and significant or common. Certainly the two scores 
represent different kinds of achievement and have quite different meanings for 
the two students. Because students do not progress at the same rate in differ- 
ent subjects nor in the same subject at different levels, comparisons across 
subjects are difficult to interpret. At the high school level, where students 
are not taught every subject every year, grade equivalents have largely been 
abandoned for this reason. Finally, grade equivalents seem especially likely 
to be misinterpreted as performance standards. It seems easier to accept the 
notion that, on the average, half the students in the class must be below the 
50th percentile than that half must be below "grade level". 

Perhaps the simplest source of misunderstanding of a test score to be 
guarded against is confusion among the concepts underlying the various score 
transformations. A score of 75, for example, might be a grade equivalent with 
the decimal point omitted (common practice), a percentile rank, a standard score: 
mean 50, or a standard score: mean 100. Knowledge and understanding of the spe- 

cific transformation is obviously essential to correct interpretation of the 
score. 

Norm Groups 

Because the meaning carried by norm-referenced scores is relative stand- 
ing in a defined reference group, the characteristics of the reference group 
are most important. 

Size . The group must have adequate size to provide stable results. If 
the norm group is a sample from a large population, it must be large enough 
so that variations due to sampling are minimized. Even when the norm group 
can be regarded as the entire population, as, for example, with school or 
class norms, anomalous and possibly misleading norms may be obtained if the 
group i s very sma 1 1 . 
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Representativeness . Adequate size does not insure that a norm group will 
be adequately representative of the population specified. Norm groups are fre- 
quently difficult to obtain, and it is rare that samples can be randomly selected. 
The factors that do influence selection are likely to cause the norm group to 
be unrepresentative in unknown ways. Despite the care and expense applied to 
the development of national norms for standardized achievement batteries, the 
norms for different batteries are likely to be quite different. State norms may 
be easier to develop and more meaningful, but unless testing programs are man- 
dated by the state, variations in testing practices among schools will make the 
development of representative norms difficult. "User norms", which are based 
on all the students from a defined population who happen to have taken the test, 
should be especially suspect. 

Currency . Norms must be representative not only at the time they are 
developed but also at the time they are used. Norms that are not current may 
be misleading because they do not reflect educational and occupational changes. 

Appropriateness . Given technical soundness in the form of adequate size, 
representativeness, and currency of a norm group, it is also important to con- 
sider the appropriateness of a norm group both for the student and for the 
decisions to be made. The student may be currently a member of the populations 
represented by some norms, so their appropriateness for the student is assured. 

A 9th-grade student who has taken the Lorge-Thornd i ke Intelligence Test (LTIT) 
and the Iowa Test of Educational Development (ITED) is a member of the popula- 
tions represented by local school, state, and national norms for each test, all 
of which are appropriate for him. For decisions about his educational experi- 
ences in the immediate future, the local norms would be most appropriate because 
they indicate how he compares with his classmates in various areas. For longer - 
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range planning the state norms, because they represent the students with whom 
he would most likely be compared in other high schools or post-high school 
institutions, would be more helpful. National as well as state norms might be 
used in evaluating how well the school's educational program achieved in var- 
ious domains the kind of educational development expected for students with 
ability levels like those in the school. 

For example, Al ice's LTIT Verbal and Non-verbal scores of 59 and 52 put 
her at the 73rd percenti le accord i ng to 9th-grade state norms, indicating an 
above-average student. On local norms for her school, however, these scores 
are at the 99th and 93rd percentiles, respectively, which suggest that she is 
likely to move much more rapidly than most, of her classmates and may require 
special material to enable her to apply her abilities appropriately. In another 
school Brian's 9th-grade LTIT scores of 60 and 51 give him local percentile 
scores of kS and kS, indicating an average student who should progress with the 
rest of the class. His percentiles of 75 and 70 on state norms, however, show 
above average ability, suggesting that his educational program should be one 
that will support many possible post-high school options. 

Some norms represent populations of which the student is only potentially, 
not currently, a member. The MSAT norms in Table 1 are of both types. Each 
student who takes the test is clearly a member of the high school junior norm 

i 

group, but only potentially a Minnesota college freshman. Similarly, technical 
school norms for scores on the General Aptitude Test Battery (GATB) and Minne- 
sota Vocational Interest Inventory (MV 1 1 ) (Pucel and Nelson, 1970a, 1970b) 
represent applicants who successfully completed various training programs. Such 
norms indicate not only relative standing in the norm population, but also whether 
it is reasonable to consider the student as a member of the population in the 
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first place. According to Table 1 Cathy's MSAT score of 32 is average (53rd 
percentile) among high school juniors and also among Minnesota junior college 
freshmen (51st percentile) somewhat below average among state college freshmen 
(35th percentile), and substantially below average among liberal arts college 
freshmen (11th percentile). Nevertheless, Cathy clearly is a potential member 
of any of these groups, and it is reasonable to explore additional information 
about all three types of college. Douglas' MSAT score of 20, however, giving 
him a liberal arts college percentile of 1, indicates not only that Douglas' 
chances of successful performance in most Minnesota liberal arts colleges are 
quite low but also that his more specific estimates of performance in such 
colleges (see "Criterion-Referenced Scores") may not be applicable to Douglas 
because he is quite unlike the populations on which they are based. He is, 
however, a potential member of the junior college population (12th percentile), 
and performance estimates based on this group would be meaningful. It is impor 
tant to note that, although members of such norm groups are identified after 
they become members of the defined population, their status at the time they 
were tested was the same as that of the students to whom the norms are applied. 
Thus the Minnesota college freshmen norm groups were tested as high school juni 
ors, and the vocational program graduates were tested as applicants for the 
'programs. Some norms, such as those often reported for employees in various 
occupations, are based on groups of persons already in the defined population 
at the time they are tested. In applying such norms to persons who are only 
potential members of the norm population, the influence on the test results of 
status at time of testing must be considered. 

Mul t i -Score Tests 

Prof i les . Although the principles of test interpretation apply whether 
there is a single score or several , additional considerations are involved in 
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tests or test batteries that produce multiple scores. Such scores are com- 
monly presented on profiles, which offer a convenient means of displaying 
several items of information. A test profile is simply a graphic representa- 
tion of several scores on comparable scales. Fig. 3 is an example of one such 
profile, showing Edwin’s percentile scores on the DAT plotted as vertical bars 
extending above or below the midpoint of the score range for each test. Pro- 
files are often prepared also with adjacent scores connected to each other, 
rather than to the midpoint of the scales, with straight lines, as in Fig. 4. 
The key word in the definition of a test profile is "comparable". It is inap- 
propriate to profile raw scores because there is no basis for comparing raw 
scores on one test with those on another. The raw scores must be transformed 
to scales with comparable units, such as percentiles or standard scores. Fur- 
thermore, the transformations for all tests must be based on the same norm 
group. The provision of such comparability was an important objective and is 
now a basic feature of standardized batteries of aptitude and achievement tests. 

Figure 3 about here 



Difference scores . Because prof i les do make score comparison easy, it is 
important to guard against over-interpretation of the differences that appear. 

The concept of error of measurement is especially important in evaluating dif- 
ferences in scores because the measurement errors cumulate, making the differences 
less reliable than. the separate scores. In psychometric terms the standard error 
of the difference, S.E.p is given by 
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being compared. 



+ S* (5) 
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are the standard errors of the two tests whose scores are 
If the two standard errors are equal, formula (5) indicates 
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that S. E.q is about 1.4 times the standard errors of the individual tests. 
Computation of S.E.p is cumbersome, and test publishers commonly offer 
convenient guides to the significance of score differences. When scores 

[ 

are reported as percentile bands, as on School and College Ability Tests 
and Sequential Tests of Educational Progress, bands that do not overlap are 
regarded as representing reliably different true scores. The manual for the 
High School Stanford Achievement Test (SAT) includes a table of standard errors 
of difference for each pair of tests in the battery, which should be consulted 
in evaluating SAT profiles. The reported S. E.q of 5 for Spelling and Numerical 
Competence, for example, indicates that only one-third of the time would dif- 
ferences as large as 5 be obtained if the true scores for these abilities are 
equal, and only 5% of the time would differences as large as 10 be obtained. 

Nearly all of the SAT S.E.q's range from 4 to 6, although a few are as 
small as 3* Standardized tests used for individual student diagnosis and 
guidance should generally have rel iab i 1 i t ies close to .9, which will provide 
S.E.q's of about half a standard deviation (5 points on the SAT standard score 
scale). The profile for the DAT is printed with 1 J nch=I $. E. = 2 S.E.q (approx- 
imately), so that differences of one inch or more correspond to a critical ratio 
of 2 (5 percent significance level) and may be regarded as significantly differ- 
ent. It is suggested that differences of one-half inch be interpreted if 
confirmed by other evidence). Comparison of Edwin's DAT scores in Fig. 3 wi th 
the 50th percentile reference line indicates that his scores are generally low, 
only the score on Mechanical Reasoning reaching the average level. Of the 
individual scores, Mechanical Reasoning is significantly different from all 
except perhaps Space Relations; whereas the other, despite their apparent dif- 
ferences, are sufficiently similar that differences among them should not be 
emphas ized . 
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Profile application s. Profiles conveniently display both the overall 
level of a student's scores and areas of strength or weakness. Thus Frank's 
llth-grade ITED scores in Fig. k show generally superior performance, with 
special strength in mathematics and some weakness in English expression, lit- 
erature, and vocabulary. The scores provide a basis for discussion with Frank 
of his high school program for the remainder of his junior and senior year and 

♦j 

of his post-high school plans. The counselor may wish to suggest that Frank 
concentrate on improving his communication skills in preparation for college 
work. Fig. k illustrates another use of profiles, namely for examining change. 
Frank's performance is very consistent from the 9th- to the llth-grade, except 
for a fairly sizable improvement in his social studies score. This change may 
reflect an unusual course sequence in Frank's case, or perhaps the development 
of new interests. 

Figure k about here 

A test profile is a convenient way to summarize group as well as individual 
test performance. Overall performance of a school or class can be evaluated in 
comparison with the norm-group average, and strengths and weaknesses can be noted 
in the same way as with individual scores. Similarly the scores of the same 
group at two different times or of two different groups at the same time can be 
plotted on one profile to faci 1 i tate group comparisons and reveal changes. Spe- 
cial care must be taken in evaluating the magnitude of group differences in terms 
of score scales based on individuals, because the mean scores of groups are much 
less variable than individual scores. Whereas an individual percentile score 
of 60 differs rather inconsequential ly from the mi dpoi nt of the norm group, a 
group mean at the 6Qth percent! le is likely to be extremely high in comparison 
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with other groups. Precise interpretation of such differences requires norms 
of group means. 

To learn more about the nature of group differences revaled by the profile 
it may be helpful to examine the distributions of scores for individual tests. 
Fig. 5 shows 9th-grade percentile scores for the state norm group and the local 
percentiles for one school plotted against raw scores on the SAT-HS English Test. 
(Either percentile scores or cumulative percentages can be used, but both groups 
must be represented in the same way.) The school's average score is somewhat 
below the state mean, but the graph shows that this difference appears almost 
entirely in the lower part of the score distribution. This evidence does not 
explain the lower mean score, of course. One possibility is that the curriculum 
or the instruction is such that insufficient attention has been given to the less 
able students. An equally tenable hypothesis is that the English achievement 
scores reflect a similar distribution of learning ability of the students in the 
school. This hypothesis could be checked by examining scores of the same stu- 
dents on a general intelligence test such as LTIT in comparison with state norms. 

Figure 5 about here 

Similarity indexes . We sometimes wish to compare a student's scores with 
each of several reference groups. This may be done either by transforming the 
student's scores into standard scores or percentiles based on each reference 
group in turn, or by displaying the reference group distributions as well as 
the student's performance in terms of a single norm. Vocational training pro- 
gram norms for the GATB and MV 1 1 (Pucel & Nelson, 1970a, 1970b) are of the 
latter type. As a student's scores are compared with each of several groups 
and simi larities and differences are noted, questions of how different the 
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student is from a given group, or which group he is most like, arise; and the 
multiple comparisons produce more information than even profiles can conveni- 
ently summarize. To summarize such comparisons and obtain answers to questions 
like those above, indexes of profile similarity are used. One such index is 
the centour score, developed by Rulon, Tiedeman, Tatsuoka,and Langmuir (1967)* 
Centour scores are like the scores on a target, where the bullseye, or the 
center (not the top) of the reference group, gets a score of 100, and the rings 
successively further in any direction from the center get successively lower 
scores. A centour score of zero, like missing the target completely, corres- 
ponds to a set of test scores outside of the "test space" occupied by any score 
in the reference group. (In actual use centour scores are usually based on 
more than two test scores, and therefore more than two dimensions, and take 
into account not just differences in individual scores but also in score com- 
binations. Consequently the "target" is elliptical rather than round, and 
multi-dimensional rather than flat.) Just as a student's percentile gives 
the percentage of scores in the norm group lower than his, the centour score 
gives the percentage of score combinations in the norm group "further out" 
than his. Like all summaries, centour scores both reveal and conceal infor- 
mation. A student's centour scores reveal his similarity simultaneously to 
a large number of reference groups in which he may be interested. At the 
same time they conceal the specific ways in which he is similar to and differ- 
ent from each of them. Centour scores of 50 for three different groups may 
result from a student's having all higher scores than the average for one 
group, all lower scores than the average for another, and some higher and 

t . . . 

some lower than the average for the third. The differences are important, 
and to discover them we must go back to each profile and consider it in detail. 
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For example, Table 2 gives the centour score representations of seven 
GATB aptitude scores for five students with respect to 18 vocational training 
groups studied by Pucel and Nelson (1970a). Greg's centour scores show little 
similarity to any of the vocational training programs. Examination of his 
aptitude scores indicates that they are all lower, some of them substantially 
lower, than average for students in these programs. These are not the only 
training programs available, of course, nor do these tests measure all impor- 
tant abilities. It will be necessary for the counselor to explore with Greg 
his possible strengths in other areas and the ways in which these strengths 
match possible training or job opportunities. 

Table 2 about here 

Helen's scores, like Greg's, are dissimilar to those of graduates of all 
18 programs, but the reason is quite different in her case. Most of her apti- 
tude scores are quite high in comparison with the vocational school population. 
Helen may want to start with a more academic program, perhaps in a junior col- 
lege, where she would have an opportunity more gradually to narrow her focus 
on a career program or a college transfer curriculum. 

A1 though none of Irene's centour scores is high, she does have several-- 
Agri -technology , Clerical training, Cosmetology--that suggest a careful look 
at these fields. Her weakest ability, according to the aptitude scores, is 
in working wi th numbers (which also influences the G score). Neither the cen- 
tour scores nor the aptitude scores provide any information about the relative 
importance of this weakness for various occupations, but both the "construct 
valid i ty" of numerical abi 1 ity and the lower mean N score of the Cosmetology 
students suggest that it may be less significant in the Cosmetology program 
than in either of the other two. 

30 



In contrast to the other students, Jerry's scores fall in the area where 
all the training groups overlap. As a result, all of his cenfcour scores are 
high, including several that are very high. Although the high centour scores 
provide some guidance, Jerry's ability pattern fits well into all the training 
groups, and other considerations than his abilities will likely determine his 
choice. 

The patternof Karen's scores is similar to Irene's, but all of her apti- 
tude scores are higher, and this difference Is reflected in higher centour 
scores in more areas. In addition to clerical and cosmetology training, prac- 
tical nursing and secretarial training offer good possibilities. 

It is important to note that similarity indexes, like all norm- referenced 
scores, do not in themselves indicate the likelihood of behavior of any kind 
other than that required by the tests themselves. To predict from the test 
scores to behavior in other situations we must rely on information about test 
validity, which is not introduced or represented by the norming process. 

Interest profiles. Interest profiles are a special case of score represen- 
tation by profile. Because of the way occupational scales are constructed, the 
practice has developed of norming each scale on its own occupational group, 
rather than on a single standard reference group for all scales. On the SV1B 
and MV l I the scores are standard scores with an occupational group mean of 50 
and S.D. of 10; on the Kuder Occupational Interest Survey the scores are, in 
effect, correlations between the students' responses and those of each refer- 
ence group. Such prof i les must be interpreted somewhat differently from those 
based on a single norm group. To provide a comparable reference point the SVIB 
and MV 1 1 profiles show the mid-third range of scores for a standard men-in- 
. general group bn each scale. These considerations do not apply to the Basic 
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Scales of the SVIB or the Homogeneous Scales of the MVII, which in each case 
are all normed on a single reference group. 

Criterion-Referinced Scores 

Whereas norm-referencing procedures provide meaning to test scores in 
terms of relative standing in a defined group of persons, criterion-referencing 
provides meaning in terms of expected behavior . The behavior may be defined 
by the test content itself, in which case we have content scores, or by a sep- 
arate (criterion) measure, iri which case we have predicted scores. 

Content Scores 

Scores on a content-referenced scale are summaries of the behavior on the 
test. Rate scores (e.g. reading rate, typing speed) and percentage scores are 
commonly used to represent performance, but to have meaning such scores must 
be accompanied by definitions of the content itself. Thus we have a "reading 
rate of 2*»7 wpm on passages from The Readers' Digest" . or "83 percent accuracy 
on 2-digit by 2-digit multiplication problems". If brief descriptions do not 
suffice to define the content, samples or examples may be used, such as "ability 
to spell 77 percent of words such as ambitious, anticipate, disappoint, eligible, 
indefinite, liability, miniature, oblige, sympathy, treasurer". To be most use- 
ful the content referred to should be not just described but scaled, so that 
mastery of a specified level implies mastery of all easier levels. Such scaling 
is just beginning in some fields, and few standardized instruments are available 
that reflect it. A fundamental requirement for the use of content-refereneed 
scores, of course, is satisfactory content validity. 

Predicted Scores 

If criterion-related validity has been demonstrated, the validity rela- 
tionship can be used to report test performance directly in terms of expected 
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criterion behavior. This is usually done in the form of either criterion 
estimates or expectancy tables or graphs. 

Criterion estimates . Given a linear relationship between a test score 
(or scores) and a criterion variable, as reflected by a significant validity 
coefficient, an individual's expected score on the criterion variable can be 
predicted by the corresponding regression equation. From the correlation of 
.60 between a college aptitude index (l) and first-term freshman grades (GPA) 
in one university, for example, we obtain the following equation for predicting 
GPA from I : 

GPA = .7** + .02 I (6) 

From this equation we fearn that the predicted GPA corresponding to the min- 
imum acceptable index of 40 is 

Like any test scores predicted scores are accompanied by uncertainty. In 
the case of predicted scores, however, this uncertainty is caused not only by 
the error of measurement of the test score, but also by measurement error in 
the criterion and by lack of perfect correlation between the true scores of the 
two measures. The combination of these three sources of error usually results 
in considerable imprecision in prediction, and it is important that this uncer- 
tainty be recognized in interpreting predicted scores. It is usually expressed 
as the standard error of estimate, computed as 

S,E, est "TCP; * 

where r is the validi ty coefficient and S c is the criterion standard deviation, 
and interpreted as the standard deviation of observed criterion scores around 
each predicted score. Fig. 6 portrays the standard error of estimate in rela- 
tion to the standard deviation of criterion scores. 
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In the case of the regression equation discussed above the standard 
error of estimate is computed from the validity coefficient and the criterion 
S.D. to be .60. This figure, combined with the predicted GPA obtained above, 
indicates that of students with an index of **0 two-thirds will obtain GPA's 
between .9** and 2.1** and 35 % will obtain GPA's between .3** and 2.7**. The 
importance of taking into account the error of estimate in interpreting pre- 
dicted scores is indicated by the width of the range needed to provide 
considerable assurance that the criterion score wili indeed be included in 
the predicted range. 

Predicted scores are used, of course, not for persons whose criterion 
scores are known, but for a new group of individuals (e.g., applicants) who 
have not been measured on the criterion. The standard error of estimate does 
not take into account sampling error in determining the regression equation. 
Interpretation of a predicted score and its associated estimate of precision 
assumes that the score comes from the same population represented by the sam- 
ple on which the regression equation was determined and that this sample is 
large enough to provide accurate estimates of the regression parameters for 
the population. 



Figure 6 about here 




Expectancy Tables 

Instead of predicting a specific criterion score and accompanying confi- 
dence band corresponding to each test score, a common practice is to report 
the probability of obtaining a criterion score within certain fixed ranges or 
above certain points. The criterion ranges for which probabilities are given 
are the same for all test scores, and the probabilities usually are reported 
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for test score ranges rather than for individual scores. The expectancy tables 
relating high school rank (HS F) and aptitude test score to first-term college 
grades given in Tables 3 and 4 are examples of this method of criterion-refer- 
enced score interpretation. These tables were produced by determining the 
proportions of students in each fifth of the predictor distribution who obtained 
a college grade average of C or better and of B or better. Application of the 
tables can be illustrated with the scores of Linda, who has always done above 
average but not outstanding work in school (HS R=63 ) and has been developing a 
serious interest in art, in which she seems to have some talent. She wants a 
"good, general education" and plans to obtain it at tne liberal arts college of 
the. state un i vers i ty» which she can attend while living at home. Her aptitude 
test score of 36 is consistent with her high school record (junior percentile= 

69 ), and is sufficient to enter the university (college percenti le=58) . Linda's 
HSR is in the 60-79 range of the university expectancy table (Table 3) which is 
clearly below average for university females (above 12% and below 59%) but 
indicates a reasonable probability ( 67 %) of obtaining at least a C average. Her 
chances of getting a B average or better are not high (10%). Information pro- 
vided by the aptitude test expectancy table is consistent. Her college percentile, 
in the 40-59 range, is in the lowest quarter of entering university students and 
shows grade probabilities nearly identical to the HSR table. Linda has been 
considering, besides the university, the applied arts program in a state college. 
According to the state college expectancy table (Table 4) Linda's scores are 
below average for entering freshmen here also, but not quite so far below, and 
her chances of getting satisfactory grades are somewhat higher (79% and 80%) . 
Properly interpreted these data can help Li nda . understand some differences 
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between the two colleges, consider the kind of program and level of intellec- 
tual challenge most appropriate for her, and stimulate her to seek further 
information to help her resolve the choice. 

Tables 3 6 ** about here 

In comparison with criterion estimates based on regression equations, 
expectancy tables do not require a normal bivariate distribution underlying 
their interpretation, and they avoid an unwarranted appearance of precision. 

The uncertainties associated with measurement error and degree of relationship 
between the variables are reflected by the probability figures themselves. How- 
ever, there are important cautions to be observed in using expectancy tables, 
cautions which reflect the fact that the tabled figures are actually propor- 
tions of previous classes rather than probabilities of future performance, (it 
has been suggested that they be called experience tables rather than expectancy 
tables.) First, in interpret iny the figures as expectancies for new students 
we must assume that the composition of the new classes will be the same with 
respect to academic ability as -the classes on which the tables are based and 
that they will be treated the same, i.e., that grading practices will remain 
the same. (Theoretically, it is unnecessary to assume that class composition 
remains the same if absolute marking standards do not change; but, because most 
grading is at least partly relative, it is more realistic to expect that a marked 
change in class composition will change the expectancies.) Entering classes will 
differ somewhat from year to year; but, unless there is a definite change in 
policy, such as an increase in admission standards, the differences are likely 
to be slight enough to maintain the validity of the expectancy tables. Over a 
period of years, however, such changes can cumulate, so the tables must either 
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be reasonably current or be accompanied by evidence of consistency, such as 
predictor and criterion distributions that remain the same from year to year, 
if they are to be relied on. Second, it is important that each table be based 
on a group large enough to provide stable proportions. Like the standard error 
of estimate the expectancies reflect uncertainty due to measurement and predic- 
tion error but not that due to sampling variation. The number of cases in each 
predictor range (i.e., each row of the tables) determines the stability of the 
proportions for that range. It is for this reason that predictors are grouped 
into just five or six categories rather than a larger number that would permit 
more discriminating probability estimates. Because the classes on which the 
percentages are based are obviously not random samples from the schools' popu- 
lations of entering students, interpretation of the standard error in terms- 
of expected variation for future classes is not possible; but it is clear that 
the expectancies based on small N's should be used with extra caution. Finally, 
expectancy tables are necessarily based on the experiences of enrolled students; 
and these students form populations that differ from high school seniors in 
ways varying from one college to another as a result of both college admissions 
policies and practices and students' college selection decisions. To refer a 
student's score to a given expectancy table it must be reasonable to consider 
him a potential member of the population on which the table is based. If the 
table shows no scores in the range containing the student's score, it is clear 
that the table is not applicable to him. Even if a small percentage of the 
class had similar predictor scores, these students were atypical of their class- 
mates with respect to these scores; and, inasmuch as they were enrolled despite 
this atypical i ty, they are likely to be atypical in unknown ways of students 
with similar scores. Thus, not only expectancies based on small N's, but also 
those based on small proportions of the class, should be viewed with caution. 
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Consider, for example, Michael's HSR of 36. The expectancy table for 
the University indicates that Michael's chances of obtaining passing grades 
(5 7%) or a B average or better (11%) are slightly larger than those of boys 
with HSR's in the range of 40-59- The first explanation to be considered for 
anomalies of this kind in the tables is a small number of cases, but in this 
case the N of about 70 (4% of 1 98 1 ) should be sufficient to avoid fluctuations 
of this size merely because of sampling error. As noted above, students who 
enroll in a college despite very low predictor scores are likely to have spe- 
cial strengths in other areas or high scores on other predictors. Unless 
Michael has such strengths he would be unwise to rely too heavily on the 
tabled expectancies. 

When predictions of the same criterion are made from more than one pre- 
dictor, the results will not always agree. Norman is thinking of going to 
the state college, and referral of his aptitude test percentile of 40 to the 
expectancy table indicates that his chances of obtaining passing grades on the 
average are 70%, but according to his HSR of 39 his chances of getting a C 
average are only 30%. Which is correct? Part of the discrepancy may be as- 
cribed to the fact that Norman's scores are at the upper edge of one interval 
and at the lower edge of the other. The coarse grouping results in some inac- 
curacy. Thus Norman's chances for a C average are undoubtedly more like those 
of a student whose HSR is 20, which is in Norman's interval with 30% probability. 
Some interpolation of probabilities may be made to adjust for this phenomenon, 
but even with such adjustments Norman's two predictions are discrepant. To 
determine which is more valid, Norman should consider with his counselor such 
information as whether special problems or responsibilities, which would not 
affect his college work, have held his high school grades down; whether his 
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other test scores confirm the ability indicated by the aptitude scores or 
> suggest that it is singularly high; whether Norman's academic motivation and 

study habits have changed in such a way as to give him a better chance of 
success in college than his high school grades indicate. 

As the considerations above suggest, the expectancy tables do not in 
themselves decide whether or not a student should attend a given college. 

The same probability of success that leads one student to choose a college 
may lead another to look elsewhere. A 30% chance of success may encourage 
one student, whereas a 70% chance may discourage another. Nor should the 
tables be used to "shop" for a college by seeking to identify the college in 
which the student has the best chance of obtaining good grades. But they do 
. provide information, suggest additional questions, and supply some answers to 

help clarify tentative choices or narrow the field of possibilities. 

Discrepancy scores . Expectancy tables may be used not only to help reach 
decisions about the future but also to help explain the past. In the latter 
application, comparison of actual performance with expectancies based on pre- 
vious scores may aid a counselor in understanding that performance. Quite 
different explanations of a student's failing grades, and different courses 
of action, may be indicated if his probability of a passing average were, say 
17%, than if it were 70%. 

Expectancy tables especially intended for this kind of interpretation, 
rather than prediction, of performance are sometimes provided for combinations 
of ability and achievement test scores. The manual for the SAT High School 
1 Battery presents quartile scores for each achievement test based on the distri- 

butions of scores for students in each stanine on the Otis Gamma Mental Ability 
Test. Orley's standard score of 57 on the English test puts him well above 
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average (national norms) for llth-graders in general, but more than three- 
fourths of llth-grade students with Otis scores in the 8th stanine, as his is, 
score higher. This information may lead the teacher or counselor to a differ- 
ent interpretation of his English score than its percentile equivalent alone. 
Because the interest in expectancy tables of this kind is on the discrepancy 
between the ability and achievement scores, they are discussed here under the 
heading of "discrepancy scores"; but in reality such expectancy tables do not 
give criterion-referenced scores at all. Neither the ability test nor the 
achievement test is a criterion. The ability test, rather, is used to divide 
the norm group into more homogeneous subgroups so that more specific norms 
can be provided. Emphasis on the norm-referenced character of this kind of 
information may help to avoid reification of score differences into concepts 
such as "underachiever" and "overachiever". At the very least it is important 
to be aware of the differences between criterion-referenced and norm-referenced 
expectancy tables. Thorndike ( 1 967) has pointed out a paradox in connection 
with the latter, namely that their value depends on the existence of moderate, 
rather than very high or very low, relationships between ability and achieve- 
ment scores. If the relationship is very low, of course, achievement norms 
for low-ability students will not be appreciably different than those for high- 
ability students; and subdivision of the norm group will be useless. If the 
relationship is extremely high, on the other hand, the tests will be measuring 
much the same thing; and discrepancies between scores on the two instruments 
will be due largely to measurement error and not subject to meaningful inter- 
pretation. For prediction purposes, of course, the higher the relationship 
represented in an expectancy table, the more helpful is the information. 
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TABLE 1 



Minnesota Scholastic Aptitude Test Norms for 
High School Juniors and Entering College Freshmen 

1968 



Percen- 

tile 


Four-yr . 
Lib. Arts 


U of M 
Four-yr Coll 


State 
Col 1 eges 


Juniors 
Col leges 


HS 

Juniors 


99 


•68-75 


67-75 


61-75 


60-75 


6A-75 


98 


67 


66 


59-60 


58-59 


61-63 


95 


65 


6A 


56 


53 


57 


90 


63 


61 


52 


A9 


52 


80 


58 


56 


bS 


A3 


b5 


75 


56 


5b 


bb 


Al 


bZ 


70 


5b 


52 


bl 


39 


39 


60 


51 


b8 


39 


35 


35 


50 


*♦7 


b 5 


36 


32 


31 


bO 


bb 


A2 


3b 


29 


27 


30 


bo 


39 


31 


26 


Zb 


25 


39 


38 


29 


25 


22 


20 


37 


36 


27 


23 


20 


10 


31 


32 


Zb 


19 


16 


5 


27 


27 


20 


16 


1A 


2 


21-23 


20-23 


16-17 


13 


1 1 


1 


0-20 


0-19 


0-15 


0-12 


0-10 
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Apti tude and 


Centour 


Scores 


for Five 


Students 




Centours 




Greg 


Helen 


1 rene 


Jerry 


Karen 


1. Aircraft Mechanics 


0 


0 


1 


50 


1 


2. Agr i -Technology 


0 


0 


21 


39 


7 


3. Automotives 


3 


0 


12 


82 


20 


4. Electronics 


0 


2 


1 


86 


9 


5. Carpentry 


0 


0 


0 


68 


1 


6. Farm Equipment Mech 


0 


0 


2 


82 


5 


7. Machine Shop 


0 


0 


1 


82 


5 


8. Mech Drafting 


0 


0 


0 


90 


4 


9. Power Home Elect 


1 


0 


k 


81 


7 


10. Printing, Graphics 




l 


2 


82 


12 


1 1 . Welding 


7 


0 


6 


68 


11 


12. Accounting 


0 


3 


6 


63 


29 


13. Clerical 


0 


2 


25 


64 


68 


14, Cosmetology 


0 


3 


24 


44 


71 


15. Data Processing 


0 


3 


3 


60 


27 


16. Practical Nursing 


0 


16 


12 


68 


74 


17. Sales 


0 


0 


4 


72 


34 


18. Secretarial 
Aptitudes 


0 


20 


10 


48 


70 


1 . Genera 1 


70 


124 


78 


113 


107 


2. Verbal 


78 


139 


96 


100 


104 


3. Numerical 


54 


117 


81 


107 


107 


4. Spatial 


97 


117 


94 


137 


101 


5. Form perception 


84 


129 


107 


111 


140 


6. Clerical Perception 


100 


129 


115 


118 


139 


7. Motor 


82 


1 03 


101 


111 


132 
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TABLE 3 



State University Expectancy Table 
for First-Term Grade Average 

FEMALES 







High 


School Rank 
N= 1 97 I 




Apt i tude Test 
N=1 990 


%\]e 


% of 
class 


Chances in 100 of a freshman 
obtaining an average grade of: 

C or Higher B or Higher 


% of 
cl ass 


Chances in 100 of a freshman 
obtaining an average grade of: 

C or Higher B or Higher 


90-99 


35 


92 


A 7 


3A 


90 


AA 


80-89 


2A 


80 


18 


19 


79 


2A 


60-79 


29 


67 


10 


25 


71 


1A 


AO-59 


10 


56 


7 


17 


65 


8 


20-39 


2 


A7 


9 


5 


5A 


3 


1-19 






- 


1 


55 


9 



MALES 







High 


School Rank 
N=1 781 




Apt i tude Test 
N= 1 8 1 2 


%\\e 


% of 
Class 


Chances in 100 of a freshman 
obtaining an average grade of: 

C or Higher B or Higher 


% of 
class 


Chances in 100 of a freshman 
obtaining an average grade of: 

C or Higher B or Higher 


90-99 


23 


88 


A5 


27 


82 


39 


80-89 


22 


7A 


20 


18 


73 


20 


60-79 


3A 


62 


10 


•3.1 


6A 


13 


AO-59 


17 


50 


7 


20 


55 


8 


20-39 


A 


57 


11 


5 


5A 


8 


1-19 




k 


- . 




I* 


- 


* the number of 


students in 


this cell is not large enough to produce a 


rel iabl e 



percentage 

- no students in this cell 
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TABLE A 



State College Expectancy Table 
for F i rst-Term Grade Average 

FEMALES 







High School Rank 
N=989 






Apti tude Test 
N=9A0 


%\ le 


% of 
class 


Chances in 100 of a freshman 
obtaining an average grade of: 

C or Higher B or Higher 


% of 
class 


Chances in 100 of a freshman 
obtaining an average grade of: 

C or Higher B or Higher 


80-99 


53 


92 


hO 


36 




92 


A3 


60-79 


30 


79 


17 


2h 
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18 
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12 


hi 


6 


20 
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2 Q 


20-39 
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3 


1 A 
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1-19 
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- 


6 
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- 








MALES 
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N=1 067 
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N=1 029 
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% of 
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Chances in 100 of a freshman 
obtaining an average grade of: 

C or Higher B or Higher 


80-99 


hi 


90 


h3 


28 




91 
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73 


16 
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25 
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57 


5 


25 




50 


16 
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5 


30 


h 


16 




59 


6 


1-19 


1 


25 


8 
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55 


6 


* the 


number of students in 
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percentage 

- no students in this cell 







Figure 1. Standard error of measurement in relation to observed 
score distribution. 
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Figure 2. Common score scales and the normal distribution. 
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Figure h. Frank's I TED scores 
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Figure 5. SAT-HS English score distributions for state and a local 
group. 
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Figure 6. 



Relation of standard error of estimate to criterion 
standard deviation. 
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